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ABSTRACT 

The  increasing  public  awareness  of  the  negative  health  effects  of  exposure  to  peak  air  pollution  levels, 
particularly  to  the  most  sensitive  population  sub-groups  like  children  and  the  elderly,  has  made  short-term  forecasts  of 
episodes  of  peak  concentrations  of  air  pollutants  at  a  local  level,  increasingly  necessary.  The  main  objective  of  the  present 
study  is  to  develop  a  statistical  model  for  predicting  a  day  in  advance  the  daily  maximum  1-  hour  average  ambient  ground 
level  ozone  concentration  for  Maun  town,  using  principal  component  regression  (PCR)  technique.  The  predictor  variables 
are  the  precursor  air  pollutants  of  ground  level  ozone,  namely,  nitrogen  oxides,  nitrogen  dioxide  and  the  previous  day’s 
ground  level  ozone  concentration,  on  the  one  hand,  and  meteorological  variables  that  include  wind  speed, 
wind  direction,  relative  humidity,  surface  temperature,  atmospheric  pressure  and  solar  radiation.  The  data  consist  of 
maximum  1-hour  interval  concentrations  every  day,  on  the  response  and  each  of  these  predictor  variables  collected  from  1 
May  2014  to  30  September  2015.  A  biased  regression  method  of  PCR  is  applied  to  try  and  minimise  the  problem  of 
multicollinearity,  usually  associated  with  multiple  regression  models.  The  detection  of  multicollinearity  is  performed  by 
using  the  Pearson  partial  correlation  matrix,  and  variance  inflation  factor  (VIF).  Model  assessment  tools  include  the  tests 
for  significance  of  individual  regression  coefficients  in  the  PCR  model,  the  coefficient  of  determination  and  F  test  to  test 
for  the  validity  of  the  overall  model.  It  is  found  that  the  estimated  PCR  model  is  based  on  principal  components  that  are 
highly  correlated  with  maxima  of  the  ozone  concentration  the  day  before,  nitrogen  oxides  concentrations  and  surface 
temperature.  Furthermore,  wind  speed,  wind  direction,  relative  humidity  and  nitrogen  dioxide  are  identified  as  possible 
causes  of  multicollinearity,  in  the  available  data. 
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1.  INTRODUCTION 

Environment-related  problems  such  as  water  and  air  pollution  have  attracted  much  greater  research  attention,  in 
the  twenty-first  century,  than  ever  before.  Ambient  (outdoor)  air  pollution  is  a  major  environmental  health  problem 
affecting  everyone  in  developed  and  developing  countries,  a  like  [1],  In  particular,  as  pointed  out  by  [2],  the  problem  of  air 
pollution  in  cities  has  become  so  severe  that,  there  is  a  need  for  timely  information  about  changes  in  the  pollution  level. 
Nowadays,  ozone  is  considered  as  one  of  the  most  significant  air  pollutants,  owing  to  the  fact  that,  it  severely  affects  plant 
tissues  and  human  health  [3]. 

According  to  [4],  NOx  and  VOCs  (especially  non-methane  hydrocarbons,  NMHC)  and  carbon  monoxide  (CO)  are 
among  the  most  important  03  precursors.  Biomass  burning  is  also  a  major  source  of  trace  gases  and  particulates  [5]. 
According  to  [5],  in  the  CAPIA  project,  a  threshold  value  of  40  ppb  is  used  to  assess  the  potential  risk  of  damage  to  maize 
by  ozone  and,  measured  data  show  that  this  threshold  is  exceeded  over  Botswana  and  on  the  South  African  Highveld. 
Also,  [6]  report  that  in  Botswana,  monitoring  is  ongoing  at  Maun  where  concentrations  of  90  ppb  and  higher  are  not 
uncommon.  These  conditions,  coupled  with  meteorological  conditions  that  may  interfere  with  the  dispersion  of  ozone  in 
the  atmosphere  or  result  in  increased  production  of  pollutants,  are  conducive  to  the  formation  of  ozone, 
suggesting  that  ozone  concentration  over  Botswana,  particularly  Maun,  may  be  relatively  high. 

The  influence  of  local  climatic  factors  on  groundlevel  ozone  concentrations  is  an  area  of  increasing  interest  to  air 
quality  management  in  regards  to  future  climate  change  [7].  Several  methods  have  been  used  for  modelling  ambient 
groundlevel03  concentrations.  Among  these  methods,  multiple  linear  regressions  have  provided  successful  results  in 
modelling  studies  [8].  The  key  assumption  of  a  multiple  regression  model  is  that  the  predictor  variables  are  independent. 
However,  multiple  linear  regression  models  usually  have  a  problem  of  multicollinearity  (collinearity  in  the  predictors), 
a  condition  that  is  due  to  high  correlations  amongst  some  of  the  predictors  in  the  model.  The  adverse  effect  of 
multicollinearity  is  that  the  resulting  ordinary  least  squares  (OLS)  estimates  of  the  regression  coefficients  of  the  correlated 
predictor  variables  tend  to  have  large  sampling  variances,  which  makes  the  estimates  unstable.  Thus,  multicollinearity 
leads  to  spurious  results  and,  consequently,  wrong  conclusions  of  a  study. 

To  try  and  minimise  the  problem  of  multicollinearity,  biased  regression  methods  are  applied.  A  biased  regression 
method  stabilises  the  partial  regression  coefficients  of  a  model  by  introducing  bias.  The  bias  is  associated  with  a  reduction 
in  the  variance  of  the  estimated  coefficients,  so  there  is  a  gain  that  more  than  compensates  for  the  increase  in  bias  [9]. 
In  the  statistical  literature,  amongst  the  most  common  biased  regression  methods  like  PCR,  ridge  regression  (RR)  and 
partial  least  squares  (PLS)  regression,  the  PCR  appears  to  be  the  most  widely  employed  in  the  atmospheric  sciences. 
For  instance,  [10]  compare  multiple  linear  regression,  feed  forward  artificial  neural  networks  using  principal  components 
as  inputs  and,  principal  component  regression  to  predict  next  day  hourly  ozone  concentrations  using  as  predictors  air 
pollutant  concentrations  of  NO,  NOx,  on  the  one  hand,  and  hourly  means  of  surface  temperature,  wind  velocity  and 
relative  humidity.  [4]  Apply  both  multiple  linear  regression  and  principal  component  regression  techniques  to  predict 
concentrations  at  the  ground  level  of  the  troposphere  as  a  function  of  several  air  pollutants  and  meteorological  parameters. 
More  recently,  [11]  compare  multiple  regression  and  principal  regression  techniques  to  forecast  total  column  of  ozone  with 
other  8  ambient  atmospheric  parameters  as  predictor  variables  over  Peninsular  Malaysia.  Most  recently,  [12]  use  multiple 
regression  analysis  and  principal  component  analysis  techniques  to  develop  models  for  the  prediction  of  column  ozone 
concentrations  with  a  few  selected  atmospheric  parameters  as  predictors  for  Peninsular  Malaysia. 
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Identifying  an  appropriate  probability  model  to  describe  the  stochastic  behaviour  of  extreme  ambient  air  pollution 

level  for  a  specific  site  or  multiple  sites  forms  an  integral  part  of  environmental  management  and  pollution  control  [13]. 

In  Botswana,  monitoring  of  ambient  air  groundlevel  ozone  by  DWMPC  is  ongoing  at  a  number  of  monitoring  sites. 

However,  the  available  literature  does  not  show  if  there  has  been  any  systematic  study  on  the  determination  of  suitable 

statistical  models  for  modeling  concentrations  of  ambient  groundlevel  ozone  in  any  of  the  monitoring  stations  in  Botswana. 

Therefore,  the  present  study  attempts  to  model  one  day  in  advance  the  daily  maximum  1-hour  average  ambient  groundlevel 

ozone  concentrations  in  the  presence  of  precursor  pollutants  and  local  meteorological  conditionsfor  Maun. 

2.  DATA  AND  RESEARCH  METHODOLOGY 

2.1.  Site  Description 

Maun  is  a  tourist  centre  in  Botswana,  being  the  gateway  to  the  pristine  Okavango  Delta.  So,  the  managers  of  air 
quality  in  the  areas  are  under  pressure  from  government,  citizen,  and  other  local  businesses  to  maintain  a  high  level  of  air 
quality  for  the  health  benefits  of  the  community  and  to  make  the  area  attractive  to  tourists  and  new  business  and  industry. 
The  town  has  shopping  centres,  hotels  and  lodges  as  well  as  car  hire,  and  the  busiest  airport  in  Botswana. 
This  monitoring  site  is  located  in  the  high  motor  vehicle  traffic  near  the  main  bus,  taxi  station  and  the  airport, 
where  the  amount  of  nitrogen  oxides  emitted  into  the  atmosphere  as  air  pollution  can  be  quite  high.  The  station  is  selected 
because  it  consistently  indicated  very  high  levels  of  concentrations  of  ground  level  ozone.  Furthermore,  this  particular 
weather  station  is  chosen  based  on  availability,  reliability  and  good  quality  of  the  data.  In  particular,  the  data  availability 
for  the  site  for  the  months  of  May  through  August,  2014,  was  100%.  However,  data  for  2014  or  the  past  years  were  not 
available. 

2.2.  Data  Description 

The  data  used  for  this  study  consist  of  123  the  daily  maximum  1-hour  average  ambient  groundlevel  ozone 
concentrations  of  03,  3  precursor  pollutants  of  groundlevel  ozone,  nitrogen  oxides  (NOx),  nitrogen  dioxide  (N02), 
the  day  before’  s  peak  groundlevel  ozone  concentration  (Ot_i),  all  measured  in  parts  per  billion  (ppb)  by  volume, 
and  8  meteorological  variables,  namely,  wind  speed  (WS),  measured  in  meters  per  second,  wind  direction  (WD), 
in  degrees,  relative  humidity  (RH),  in  percentage,  surface  temperature  (T)  in  degrees  Celsius,  atmospheric  pressure 
(P),  measured  in  millibars  and  solar  radiation  (R),  measured  in  Watts  per  square  meter.  These  data  were  obtained  from 
DWMPC,  Ministry  of  Environment  Wildlife  and  Tourism,  Botswana.  Datasets  from  the  DWMPC  are  considered  to  be  of 
high  quality,  with  no  long  periods  of  missing  values  over  the  study  period. 

2.3.  Principal  Component  Regression  Model 

For  day  t,  t  =  1,  2, 3, ...,  123  ,  denote  by  Y,  the  maximum  groundlevel  ozone  concentration,  with  the  predictor 
variables03(t_i),  WS,WD„  T,  RH,  P,  R,  NOx  and  N02.  as  defined  before. Principal  components  regression  is  a  regression 
technique  that  is  based  on  principal  component  analysis  (PCA).  Without  loss  of  generality,  let  us  denote  the  original 
predictor  variables  by  Xl ,  X2,...,  Xp  .  PCA  is  a  multivariate  technique  that  seeks  to  explain  the  variance-covariance 

structure  of  the  p  original  variables,  Xl ,  X2,...,  Xp  ,  through  a  few  uncorrelated  linear  combinations,  ZX,Z2,...,Z  p  , 

of  the  original  variables.  The  basic  idea  behind  PCR  is  to  calculate  the  principal  components  and  then  use  some  of  them  as 
predictors  in  a  linear  regression  model  fitted  using  the  ordinary  least  squares  (OLS)  method.  PCR  can  minimise 
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multicollinearity  by  excluding  some  of  the  low-variance  principal  components  in  the  regression  model. 

To  construct  the  PCR  model  in  matrix  form,  we  start  with  the  usual  multiple  linear  regression  model  given  by 
Equation  (1). 

y (wcl)  —  “*"e(ral)  W 

where  n  is  the  sample  size,  y  is  a  vector  of  values  of  the  dependent  variable,  X  is  a  matrix  of  predictor 
variables,  which  has  a  multivariate  distribution  with  mean  vector  ft  and  covariance  matrix  X  ,  p  is  a  vector  of  regression 
coefficients  and  e  is  the  error  term.  The  key  assumptions  of  the  model  are  (i)  a  linear  relationship  between  the  response 
variable  and  the  predictor  variables,  (ii)  the  independent  variables  are  not  highly  correlated  with  each  other  and  fiii)  the 
errors  are  jointly  normally  distributed  independent  variables  with  a  mean  of  zero  and  constant  variance. 

zs  .  ,  .  1 

The  OLS  estimator  of  p  is  given  by  P  =  ( X'XJ  X'Y  .  It  must  be  noted  that  the  variables  considered  in  the  present 

study  are  expressed  in  different  units  of  measurement.  To  correct  for  the  varying  units  of  the  variables  and  also  allow  for 
the  independent  variables  to  be  treated  as  equally  important,  all  the  independent  variables  are  scaled/standardised  by 
subtracting  the  mean  of  each  variable  from  the  variable  and  then  dividing  by  the  corresponding  standard  deviation, 

so  that  that  all  the  variables  have  mean  zero  and  unit  variance,  and,  as  a  result  of  standardisation  XX  =  R  . 
Effectively,  this  means  that  the  principal  component  analysis  is  performed  by  using  the  correlation  matrix  R  of  the  original 

data. Thus,  to  obtain  the  PCR  model,  we  transform  the  original  variables,  X f ,  X2,...,  X  p  ,  to  the  principal  components  , 
Z,  ,Z2,...,Zp ,  using  Equation  (2). 

^  (pxn)^(nxp)  ~  ^(pxpPtpxp'F  (pxp)  ~  ^  (pxp)^1  (pxp)  ^ 

Where  D  is  the  diagonal  matrix  of  the  eigenvalues  of  X  X  ,  P  is  the  eigenvector  matrix  of  XX 
(and  is  orthogonal  as  P  P  =  I  )  and,  Z  is  a  matrix  of  the  principal  components  Zl,Z2,...,Z  p  ,  with  successively  smaller 

variances  so  that  var(Z,)  >  var(Z2  )>,...,  var(Z;))  ,  where  Zj  =  j3ljXl+  /32jX2+ ...  +  j3pjXp,  j  =  1,2,...,  p  . 

Therefore,  the  PCR  model  to  be  used  for  prediction  of  the  maximum  1 -houraverage  groundlevel  ozone  concentration  a  day 
in  advance,  Yt ,  is  given  by  Equation  (3). 

y,  =^zlt  +  a2z2l  +...  +  apzpt,t  =  \,2,...,n  (3) 

where  ZJt  denotes  the  score  on  the  jth  principal  component  on  day  t,  j  =  1,  2, 3, p  and  t  =  1,  2, ...,  n  . 

2.4.  Detection  of  Multicollinearity 

2.4.1.  Examination  of  Partial  Correlation  Coefficient 

To  examine  relationships  between  environmental  variables,  which  are  inherently  highly  correlated,  it  is  better  to 
use  pair  wise  partial  correlation  coefficients  rather  than  ordinary  correlation  coefficients,  which  indicate  how  much  each 

variable  uniquely  contributes  to  the  coefficient  of  determination,  R~  ,  over  and  above  that  which  can  be  accounted  for  by 
the  other  predictors.  The  partial  correlation  coefficient  between  variables  X,  and  Xj  i  ^  j  =  1,2,...,  p  ,  holding  all  the  other 
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variables  constant  is  given  Equation  (4). 


ru 


y[^)j 


(4) 


where  r.  =  r.f ,  i  ^  j  =  1, 2, p  ,  is  the  (i,j),h  element  of  the  inverse  of  the  correlation  matrix  R  of  X.R1  =  [/?,,]. 
Generally,  a  (partial)  correlation  coefficient  of  at  least  0.80  is  an  indication  that  multicollinearity  may  exist. 

2.4.2.  Variance  Inflation  Factor 

The  VIF  is  a  measure  of  how  much  the  variance  of  the  estimate  of  the  ordinary  least  squares  (OLS)  regression 
coefficient  is  being  inflated  by  multicollinearity.  Given  a  set  of  p  predictor  variables,  Xl,X^,...,X p,  the  VIF  for  the 
jth  variable  is  obtained  by  regressing  Xj  on  all  the  remaining  predictor  variables  in  the  model,  and  is  given  by  Equation  (5). 


VIF, 


l~Rj 


(5) 


Where  R  ■  denotes  the  coefficient  of  determination  of  the  regression  model  of  Xh  on  all  the  other  predictors. 
A  value  of  VIF ■  >  10  (or  tolerance  (=1/VIF)  value  less  than  0.1)  roughly  indicates  significant  multicollinearity. 


2.4.3.  Eigenvalues  of  the  Correlation  Matrix 

The  eigenvalues  of  R,  denoted  by  ,  and  usually  ordered  in  decreasing  magnitude, 

can  be  used  to  study  the  correlation  structure  of  R,  whose  full  rank  is  given  by  the  total  number  of  the  eigenvalues. 
When  there  is  no  multicollinearity,  the  eigenvalues^,  /  =1,2, ...,/?,  will  all  be  equal  to  1.  A  zero  A .  indicates  linear 
dependency  and  an  eigenvalue  close  to  zero  indicates  a  near  linear  dependency  [14]. 

2.4.4.  Determining  the  Number  of  Components 

One  of  the  tools  for  determining  the  number  of  principal  components  to  be  used  for  further  analysis  is  to  examine 
the  cumulative  proportion  to  determine  the  amount  of  variance  that  the  principal  components  explain. 
Retain  the  principal  components  that  explain  an  acceptable  level  of  variance,  depends  on  the  application. 
For  example,  if  we  want  to  perform  further  analyses  on  the  data,  we  may  want  to  have  at  least  90%  of  the  variance 
explained  by  the  principal  components.  We  can  also  use  eigenanalysis  method,  which  uses  the  size  of  the  eigenvalue  to 
determine  if  its  associated  principal  component  should  be  included  for  further  analyses.  Using  the  Kaiser  criterion,  we  use 
only  the  principal  components  with  eigenvalues  that  are  greater  than  1 . 

A  widely  used  tool  for  deciding  on  the  number  of  principal  components  to  retain  in  a  PCA  is  the  scree  plot 
obtained  by  plotting  ordered  pairs  of  values  (y,  Ay),/  =  1,  2,  ...,p.  The  basic  rule  is  to  select  the  components  in  the  steep 
curve  before  the  first  point  that  starts  the  line  trend. 
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2.4.5  Assessing  Model  Adequacy 

To  test  whether  there  is  sufficient  evidence  of  a  linear  relationship  of  an  individual  predictor  and  the  response 
variable,  we  test  the  hypothesis  //„:  (3j  =  0  against  the  alternative  //„:  0  for  j  =  1, 2, ... ,  p. 


Undci7/0,  the  student’s  t  statistic  is  defined  by  Equation  (6) 

bj 

t_  :^)~£n-p-1 


(6) 


and,  the  null  hypothesis  is  rejected  in  favour  of  the  alternative  if  the  calculated  t  value  is  improbably  large. 

The  F-test  is  used  to  check  for  the  validity  of  the  overall  model.  That  is,  H0:  /?■,  =  /?2  =  =  /?p  =  0  against  the 

alternative//,, :  At  least  one/?,  ^  0  for  j  =  1,  2, ...  ,p.  The  test  statistic  is  given  by 

F  =  —  ~Ffp.n-p-i  (7) 

MSE  t P’n  P  1  v  ’ 


Where  MSE  and  MSR  denote  the  mean  square  for  error  and  mean  square  for  regression,  respectively. 
The  decision  rule  is  to  reject  the  null  hypothesis  in  favour  of  the  alternative  if  the  calculated  F  value  in  (7)  is  improbably 

large.  The  overall  model  is  further  validated  by  considering  the  Coefficient  of  determination  (  R~ )  defined  by 


n 

IC 

t= 1 

S<T -T)2 

t= 1 

_  t= 1 

Zu;-f)2 

_  t= 1 

(8a) 


Where  Yt  is  the  observed  ozone  concentration  at  time  t ,  Y  is  the  predicted  ozone  concentration  at  time  t  and  fl 

is  the  number  of  observations.  The  coefficient  of  determination  R~  in  Equation  (8a),  and  the  adjusted  r-squared  Rad]2  give 

in  Equation  (8b),  give  the  proportion  of  the  total  variability  in  the  response  Y  explained  by  the  fitted  regression  model. 
The  adjusted  R-squared  provides  an  adjustment  for  the  degrees  of  freedom. 


=  1- 


(1  — /?2)(n  — 1) 
n  —  k  —  1 


(8b) 


Where  fl  =  the  number  of  sample  observations,  R~  coefficient  of  determination  and  k  cumber  of  independent 
repressors. 

3.  RESULTS  AND  DISCUSSIONS 

3.1  Time  Series  Plot  of  the  Data 


Figure  1  shows  a  time  series  plot  of  the  daily  maximum  1-hour  average  ambient  groundlevel  ozone  concentrations 
for  Maun  for  the  study  period.  From  the  figure,  most  important  to  notice  is  that  during  August  2014,  the  threshold  value  of 
40  ppb  used  in  the  CAPIA  project  to  assess  the  potential  risk  of  damage  to  maize  by 


Impact  Factor  (JCC):  3.9876 


NAAS  Rating  3.45 


7 


A  Principal  Component  Regression  Model  for  Forecasting  Daily  Peak  Ambient  Ground  level 
Ozone  Concentrations  in  the  Presence  of  Multicollinearity  Amongst  Precursor  Air  Pollutants 
and  Local  Meteorological  Conditions:  A  Case  Study  of  Maun 


Day 

Figurel:  Time  series  plot  ot  daily  maximum  groundlevel  ozone  concentrtions  tor  Maun,  1  May  -  31  August,  2014 


Figure  1 

Ozone  was  exceeded  over  Maun  about  10  times.  The  highest  daily  1-hour  average  O3  concentration  of  53.079  ppb 
occurred  on  the  28th  of  August  2014  (day  120).  Also,  the  month  of  August  experienced  most  variability  in  the  daily  peak 
ozone  concentrations. 

3.2.  Detecting  Multicollinearity 

3.2.1.  Examination  of  Pairwise  Partial  Correlation  Matrix 

From  the  matrix  of  pairwise  correlations  for  predictor  variables  (not  presented  in  the  paper),  it  can  be  seen  that 
some  of  the  predictor  variables  are  highly  correlated  even  after  controlling  for  the  effects  of  the  other  predictors. 
For  example,  when  the  other  predictor  variables  are  kept  constant,  the  correlation  between  WD  and  WS  is  positive  and 
very  high  (  r23x=  0.945  ).  A  similar  observation  can  be  made  about  RH  and  N02,  with  r58t=  0.802  . 

Also  noteworthy  is  that  the  partial  correlation  between  N02  and  NOx  is  fairly  strong  (r89x  —  0.631)  .This  is  to  be  expected 
as  nitrogen  oxides  (NOx)  in  the  ambient  air  consist  primarily  of  nitric  acid  (NO)  and  nitrogen  dioxide  (N02) 
(i.e.,  NOx=NO+N02).  These  results  show  that  some  of  the  predictor  variables  in  this  study  are  highly  correlated,  which  is  a 
sign  of  the  possibility  of  the  existence  the  problem  of  multicollinearity. 

3.2.2.  Variance  Inflation  Factor 

Results  in  Table  1  clearly  show  that  the  VIF  values  for  the  variables  WS,  WD,  RH  and  N02  are  each  larger  than 
the  threshold  value  of  10,  indicating  that  a  high  degree  of  multicollinearity  exists  and  may  be  due  to  these  variables. 
These  results  confirm  those  of  the  partial  correlation  analysis  and  the  VIF  that  the  problem  of  multicollinearity  amongst 
some  predictor  variables  in  this  dataset  exists.  Therefore,  to  control  and  minimize  the  effects  of  multicollinearity,  we  model 
the  Maun  daily  maximum  ground  level  ozone  data  using  the  principal  component  regression  technique. 

3.3.  Principal  Component  Regression 

For  these  reasons  stated  earlier  on,  the  covariance  matrix  is  scaled  so  that  the  principal  component  analysis  is 
performed  by  using  the  correlation  matrix.  Table  4.1  shows  the  principal  components  computed  from  the  correlation  matrix. 

3.3.1.  Determining  the  Number  of  Components 

The  matrix  of  weights  of  the  principal  components  computed  from  the  correlation  matrix  has  not  been  provided  in 
this  paper  for  a  lack  of  space.  However,  it  must  be  noted  that  the  magnitudes  of  the  coefficients  also  dependents  on  the 
variances  of  the  corresponding  variables,  and,  to  achieve  a  simple  structure  that  makes  interpretation  of  the  principal 
components  as  intuitively  as  possible,  a  rotation  of  the  axes  (dimensions)  of  the  first  few  selected  principal  components  is 
carried  out  using  the  varimax  rotation,  a  popular  orthogonal  rotation  method. 
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We  discuss  the  results  of  the  varimax  rotation  (contained  in  Table  3)  of  the  selected  five  principal  components  in 
the  sequel  before  fitting  the  PCR  model. 

To  determine  the  minimum  number  of  principal  components  that  account  for  most  of  the  variation  in  the  data, 
we  use  the  eigenanalysis  of  the  correlation  matrix  and  the  Scree  plot,  shown  in  Table  2  and  Figure  2,  respectively. 
In  Table  2,  the  eigenvalues  of  the  principal  components  show  that  the  data  has  the  largest  variance  along  the  component  1 
axis  and  the  second  largest  variance  along  the  axis  of  component  2,  and  so  on.  Principal  component  Zt  alone  contributes 

67.3%  to  the  total  variance  in  the  original  variables,  followed  by  the  2nd  highest  of  22.2%  by  Z.~, .  The  principal  components 
Z1  through  z6  have  eigenvalues  A .  >1.  Thus,  using  the  Kaiser  criterion,  we  would  use  only  the  principal  components  with 

eigenvalues  that  are  greater  than  1,  the  first  6  principal  components.  However,  z6 through  Z.u  contribute  increasingly  very 

little  or  nothing  to  the  total  variance,  and,  cumulatively,  the  first  5  components  contribute  98.9%  to  the  variability  in  the 
original  variables.  Hence,  on  balance,  only  the  first  5  principal  components  should  be  retained  for  further  investigation. 


To  further  help  in  the  choice  of  the  number  of  principal  components  to  be  selected  for  inclusion  in  the  model, 
the  scree  plot  for  the  data  on  the  predictor  variables  in  the  present  study  is  given  in  Figure  2.  Noticeably,  an  elbow  occurs 

at  dimension  j=3  in  the  scree  plot  but,  the  slope  of  the  graph  becomes  fairly  constant  after  /L  and  all  the  following 
eigenvalues  become  relatively  small  and  about  the  same  size.  Therefore,  it  is  decided  that  the  first  5  principal  components, 
Z,  through  z5  ,  should  effectively  explain  the  total  variance  in  the  original  variables.  Consequently,  we  fit  a  principal 
regression  model  of  03  on  the  principal  components, z1( . 


pcs 


Figure  2:  Screen  Plot  of  Precursor  Ambient  Air  Pollutants  and  Meteorological  Variables  for  Maun 


As  mentioned  before,  to  aid  in  the  interpretation  of  principal  components,  we  performed  a  varimax  rotation  of  the 
selected  first  five  principal  components.  The  rationale  for  the  varimax  rotation  is  to  produce  axes  with  a  few  large  loadings 
and  as  many  near-zero  loadings  as  possible.  Loadings  are  regarded  as  correlation  coefficients  between  the  original 
variables  and  their  respective  principal  components,  with  the  loading  for  principal  component  Zj  given  by 
lj  =  ejjhj  ,j  —  1,2,  ...,p.  Inthe  present  study,  the  loadings  are  expected  to  be  very  similar  to  the  principal  component 
scores  since  all  the  variables  have  equal  (unit)  variance.  The  results  in  Table  3  show  that  the  first  principal  zl 

correlates  almost  perfectly  with  03(t.i).That  is,  it  increases  with  the  previous  day’s  peak  groundlevel  ozone  concentration. 
So,  this  component  mainly  measures  background  groundlevel  ozone  concentration  at  the  monitoring  station 
This  result  seems  to  indicate  that  groundlevel  ozone  concentrations  are  persistent  so  that  the  current  level  of  concentrations 
are  most  likely  to  build  on  the  previous  day’s  peak  groundlevel  ozone  concentration.  The  second  principal  component 
increases  with  decreasing  concentrations  of  nitrogen  dioxide  NO2  and,  to  a  less  extent,  relative  humidity, 
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RH.  This  means  the  second  component  z2  primarily  measures  the  concentration  of  ambient  air  nitrogen  dioxide  and 
relative  humidity  at  the  monitoring  site.  Thus,  these  two  air  pollutants  vary  together.  Similarly,  the  third  component  is 
highly  correlated  with  only  one  predictor  variable,  NOx,  indicating  that  ^  increases  with  nitrogen  oxides.  So,  z3  primarily 

measures  ambient  air  concentration  of  oxides  of  nitrogen  at  the  station.  Likewise,  components  z4  and  z5  mainly  measure 
wind  direction  and  surface  air  surface  temperature,  respectively. 

3.3.2.  Fitting  the  Principal  Component  Regression  Model 

A  summary  of  the  results  of  fitting  the  principal  component  regression  model  of  03  on  the  principal  components, 
Z),  Z2,  Z3,  Z4  and  z5  are  contained  in  Table  4.  The  large  value  of  F  (94.7)  with  a  p-value:  <  0.001  is  enough  evidence  that 
the  overall  model  is  valid.  In  addition,  R2  =  0.8019  and  R2dj  =  0.7934  both  show  that  a  high  proportion 
(about  80%)  of  the  variability  in  the  concentration  of  ambient  air  groundlevel  ozone  in  Maun  is  explained  by  variability  in 
the  componentsz1;z3  andz5.  However,  the  t  test  for  significance  of  the  individual  components  suggests  that  Zn  and  z5 

individually  is  not  significant.  This  means  that  their  contribution  in  the  formation  of  03  is  not  important  and,  therefore, 
they  can  be  excluded  from  final  the  model.  Clearly,  the  results  suggest  that  there  are  three  important  components  for 
forecasting  a  day  in  advance  the  daily  maximum  1-hour  average  groundlevel  ozone  for  Maun  components  are  z1(  z3  andz5, 
all  of  which  are  highly  significant  with  a  p-value:  <  0.001  each.  Therefore,  the  estimated  PCR  model  for  forecasting  one 
day  in  advance  the  daily  maximum  1-hour  average  groundlevel  ozone  for  Maun  is  given  by  Equation  (9)  as 

03  =  0.83297 zt  +  0.90783z3  +  0.91919z5  (9) 

4.  CONCLUSIONS 

This  paper  presents  a  principal  regression  model  for  forecasting  a  day  in  advance  the  daily  maximum  1-hour 
average  groundlevel  ozone  for  Maun  using  precursor  ambient  air  pollutants  for  ozone  and  local  meteorological  conditions 
as  predictor  variables.  The  model  minimises  the  effects  of  multicollinearity  amongst  the  predictors,  which  is  inherent  in 
environmental  data,  by  excluding  some  of  the  low-variance  principal  components.  It  is  found  that  estimated  PCR  model  is 
based  on  principal  components  that  are  highly  correlated  with  three  predictor  variables:  the  day  before’s  groundlevel  ozone 
concentration,  concentrations  of  nitrogen  oxides  and  surface  temperature.  The  estimated  PCR  equation  is  easy  to 
implement  and  can  be  adapted  to  other  air  pollution  monitoring  stations  around  Botswana. 

There  are  two  limitations  of  the  study.  First,  data  on  other  important  variables  contributing  to  groundlevel  ozone 
formation,  like  VOCS  and  carbon  monoxide  are  currently  not  measured  at  the  station.  Secondly,  data  for  the  summer 
months  were  not  available,  at  least  at  the  time  of  collecting  the  data  for  the  study,  and  the  available  data  were  for  only  4 
months.  However,  the  results  of  this  study  have  highlighted  the  important  relationship  between  groundlevel  ozone 
concentrations  and  local  ambient  air  pollution  levels  and  meteorological  conditions  in  the  study  area. 


Table  1:  Individual  Multicollinearity  Diagnostics  for  Air  Pollutants  and  Meteorological  Parameters 


Variable 

03(t-i) 

WS 

WD 

T 

RH 

P 

R 

no2 

NO* 

VIF 

1.8336 

17.4160 

15.2253 

1.5152 

17.6478 

1.9463 

6.0510 

13.6290 

1.9826 
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Table  2:  Matrix  of  Weights  of  the  Principal  Components  Computed  from  the  Correlation  Matrix 


Component 

PCI 

PC2 

PC3 

PC4 

PC5 

PC6 

PC7 

PC8 

PC9 

Eigenvalues  ; 

171.607 

56.620 

14.750 

5.304 

3.903 

1.866 

0.779 

0.090 

0.010 

Proportion  of  Variance 

0.673 

0.222 

0.058 

0.021 

0.015 

0.007 

0.003 

0.000 

0.000 

Cumulative  Proportion 

0.673 

0.895 

0.953 

0.974 

0.989 

0.997 

1.000 

1.000 

1.000 

Table  3:  Varimax  Rotation 


Zt 

z2 

Z4 

z5 

03.t.l. 

0.999 

- 

- 

- 

ws 

- 

- 

- 

-0.154 

- 

WD 

- 

- 

- 

-0.946 

- 

T 

- 

- 

- 

0.123 

-0.930 

RH 

- 

-0.566 

-0.176 

- 

0.121 

P 

- 

- 

0.118 

0.248 

0.325 

R 

- 

- 

- 

- 

- 

N02 

- 

-0.812 

0.253 

- 

-0.107 

NOx 

- 

0.105 

0.942 

- 

- 

Table  4:  Estimates  of  the  Air  Pollutants  and  Meteorological  Parameters  Based  on  the  PCR  Model 


Variable 

Estimate 

Std.  Error 

t  value 

P-value 

Intercept 

-166.68938 

95.15875 

-1.752 

0.082 

Zt 

0.83297 

0.04075 

20.442 

<0.001  *** 

^2 

-0.08170 

0.07094 

-1.152 

0.252 

^3 

0.90783 

0.13899 

6.532 

<0.001  *** 

Z4 

0.05694 

0.23179 

0.246 

0.806 

^5 

0.91919 

0.27019 

3.402 

<0.001  *** 

Significance  codes:  0  '***'  0.001  '**'  0.01  0.05  '.'0.1  ' '  1 

Residual  standard  error:  5.896  on  117  degrees  of  freedom.  Multiple  R-squared:  0.8019,  Adjusted  R-squared:  0.7934 

F-statistic:  94.7  on  5  and  117  DF,  p-value:  <  0.001 
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